Same-gender citations do not indicate a substantial gender homophily bias

Can the male citation advantage (more citations for papers written by male than female scientists) be explained by gender homophily bias, i.e., the preference of scientists to cite other scientists of the same gender category? Previous studies report much evidence that this is the case. However, the observed gender homophily bias may be overestimated by overlooking structural aspects such as the gender composition of research topics in which scientists specialize. When controlling for research topics at a high level of granularity, there is only little evidence for a gender homophily bias in citation decisions. Our study points out the importance of controlling structural aspects such as gendered specialization in research topics when investigating gender bias in science.


S1.1. Dataset
For the analyses presented in Fig. 2, Fig. 4, and Figs 5A-5E, we used papers that are included in the Faculty Opinions database and for which we could merge the necessary metadata from the Web of Science (WoS) database. The Faculty Opinions database initially contains information for 162,071 papers. Due to missing metadata and restrictions in our analyses, we could not include all these papers in our analyses. Fig. S1 illustrates how many papers had to be excluded at each step in the data preprocessing phase. Most papers had to be excluded due to insufficient information on the authors' gender (for at least one author of a paper, the gender could not be inferred). We decided to use only reliable gender information, although this reduced the number of papers that could be included in the analyses. For a large share of authors, the gender could not be inferred because no first name or only the initials of the authors' first name(s) are given. Of all authors with missing gender information, 1.9% have no author name provided in the WoS, 28.6% have only one letter given as first name string, and another 19.3% have only two letters given as first name string. This suggests that for at least one-third of the authors the missing gender information is due to a missing full first name: for almost all cases with only one letter (and for most cases with two letters) given, it can be assumed that these are the first names' initials.
On the paper level, we labelled the gender as missing if at least one author has missing gender information. Among the papers with missing gender information, 36.7% have at least one author with missing name and 73.5% at least one author with unisex name (see S1.2 for more information on how the authors' gender was classified).
Further analyses show that information on gender is missing for papers with older publication dates in particular. Papers with sufficient gender information have on average been published in 2012, while papers with insufficient gender information have on average been published in 2011. This can be expected due to the increasing availability of full first names in the WoS over time. Since even one author with missing gender information results in no gender being assigned to a paper, papers with multiple authors are more likely to have missing gender information. This is also reflected in our data: papers that could be assigned a gender category were written on average by fewer authors (arithmetic mean: 6; median: 5) than papers that could not be assigned a gender category (arithmetic mean: 10; median: 7). Papers without gender information also have more citations on average (arithmetic mean: 126; median: 57) compared to papers with gender information (arithmetic mean: 92; median: 43). However, this difference reduces and reverses (papers without gender information receive on average six citations less than papers with gender information) once the publication year and number of authors are controlled for. This means that missing gender information is only slightly related to citation counts independently of publication year and number of authors.
In our regression analyses (models M1-M3), we compared male-and mixed-authored with female-authored focal papers. Including also mixed-authored papers is one advantage of this approach (besides the possibility to use control variables). For mixed-authored papers, the share of male-authored citing papers is larger than for female-authored papers, but smaller than for maleauthored papers; thus, dropping mixed-authored papers (as has been done in other analyses) yields an upper bound of gender effects on citations. We were able to include 38,439 papers in our analyses (few cases were lost due to missing information on the Faculty Opinions keywords that we used as controls, see Fig. S1). We controlled for the Faculty Opinions keywords by single dummy variables. These dummies measure topic similarity only on a low level of granularity: we controlled for single keywords, but not for idiosyncratic combinations of keywords that would define research topics more precisely (but see S2.2.1 for some robustness analyses with regressions based on pairs of papers that allow for more fine-grained controls of papers' similarity). Therefore, we complemented regressions including single dummy variables with a novel approach based on pairs of papers. The alternative approach allowed us to control the papers' topic not only at a more fine grained level, but also at different levels of granularity. For these analyses presented in Figs. 4-5, we paired papers of female-and male-authored papers. We could not include mixed-authored focal papers in the analyses. While pairing the focal papers increases the number of observations (different pairs of papers are observed instead of single papers), it reduces the number of focal papers that we could consider due to the necessary omission of mixed-authored papers. From the 10,541 papers that we included in the analyses based on pairs of papers, 1,261 have only female authors and 9,280 only male authors. Building all possible pairs of papers such that one paper is authored only by female scientists and the other paper is authored only by male scientists results in 11,702,080 pairs that we could use for plotting the histograms in Fig. 4.
In the analysis presented in Fig. 5F based on WoS data, we considered all papers that are included in the WoS, have been published in 2012, and are of document type 'article' or 'review' (to include only substantial papers). This comprises 1,396,207 papers in total. Due to missing metadata, the set reduces to 1,235,021 papers. For 1,212,097 of the papers, the title and abstract are available in the WoS. The gender of all authors could be classified for 399,319 papers (117,143 with only female authors and 282,176 with only male authors). When matching the pairs, we only considered pairs for which both papers are in the same WoS subject category. This drastically reduces the computational complexity and makes the results more comparable to those based on the Faculty Opinions data: the papers in the Faculty Opinions database are field-specifically restricted (to biomedicine). We were able to build 508,941,740 pairs of papers with one being female-authored and the other being male-authored.

S1.2. Determining gender of names
In order to infer the authors' gender (of both focal papers and their citing papers), we used an open source application that makes it possible to assign the gender to first names [1]. This application is based on a list of 44,568 first names that have been mapped to a gender, depending on the country of origin. In order to consider the country of origin when inferring the authors' gender, we used the authors' affiliation as a proxy. In the case of unisex names, the application returns no gender to a given first name (in combination with a country of origin): the name may refer to female, male, or non-binary persons. If a first name is usually associated with a particular gender in a country, but is also used for the other gender in another country, the application classifies this name as probably female/male. These cases are included in our analyses, i.e. a probably female (male) author is assumed to actually be female (male). Fig. S6 shows the results for the pair-based analyses using the Faculty Opinions data when probably female/male authors are excluded and only the more reliable gender assignments are used. Since these results do not differ substantially from the results obtained when including the less reliable gender assignments, we conclude that both approaches can be used interchangeably for our analyses.
In order to operationalize the gender of author teams, we distinguished three cases: all authors are female, all authors are male, and the authors are of mixed gender. If we could not infer the gender of at least one author (because it is a unisex name or the name is not in the application's database), the paper is not included in our analyses. If multiple affiliations in different countries are linked to an author, we determined the gender of the author separately for each affiliation. If the gender classifications match, the paper remains in the analyses; if there are inconsistencies across the different classifications, the paper is not included in the analyses.

S1.3. Regression analyses
We tested the regression models presented in Fig. 2 and Table 1 for heteroscedasticity, multicollinearity, and outliers. To test for heteroscedasticity, we performed Breusch-Pagan tests [2]. The tests are statistically significant on the 0.001 level for all three models, indicating that the variance of the error terms depends on the values of the independent variables. Therefore, we used robust standard errors for calculating p values in Table 1 and confidence intervals in Fig. 2. To test for multicollinearity, we calculated the variance inflation factor (VIF) for the independent variables in all models. For the dummy variables indicating the gender of the focal papers' authors, we calculated the generalized VIF (GVIF) [3]. GVIF accounts for the fact that the two dummy variables represent the same characteristic. Since the VIF/GVIF is smaller than five in every case, we assume a negligible level of multicollinearity in the models [4]. To test for outliers in our data, we calculated Cook's distance for all observations. Since all values are smaller than 0.5, we do not assume any problematic effects of outliers in our analyses [5].

S1.4. Similarity based on titles and abstracts
The similarity between two papers based on their titles and abstracts was calculated based on the term frequency inverse document frequency (tf-idf) of the words (terms) occurring in the abstracts and titles (documents). The tf-idf is a standard approach to obtain vector representations for documents. They indicate the relevance of each document's word in the collection of all documents [6]. We used the R package text2vec for calculating the tf-idf. For each paper, title and abstract were simply concatenated and the list of English stop-words provided by the R package stopwords was excluded. Stop-words are the most common words in the English language, such as "the," "is," "which" etc. These words do not add substantial meaning to a text and therefore should be filtered out before text mining procedures. Furthermore, we excluded very infrequent and very frequent words to remove noise. If a word occurs less than three times over all documents, or if the proportion of documents including the word is larger than 0.3, we excluded the word. The This rarity is measured in the second term by the logarithmized inverse frequency of documents in D which contain the word t.
To measure the similarity between two papers, the cosine similarity between their tf-idf was used. Cut-off values were needed to define different levels of similarity for which we plotted the histograms on the differences in the share of male-authored citing papers between male-and female-authored focal papers. We defined these cut-off values in a way that maximizes the comparability with the approach based on the keywords from the Faculty Opinions database to define similarity. To achieve this, we set the cut-off values so that the share of pairs that are classified as similar at each level is equal to the share of pairs that are classified as similar at each level when using the Faculty Opinions keywords to define similarity. For instance, at the first level of cosine similarity, 6.5% of pairs are included, because 6.5% of pairs have one shared Faculty Opinions keyword.

S2.1. Operationalization and results of other studies
Existing studies on gender homophily in citations used different methods and datasets.
Table S1 summarizes these studies and shows the main differences with regard to the data and methods used, as well as their results summarized in the gender homophily rate.
Some but not all studies controlled for self-citations by excluding them in their analyses.
To operationalize the authors' gender on the paper level, different methods were applied: while some studies considered all authors, others only considered the first author, the first X authors, and/or the corresponding authors. Mcelhinny et al. (2003) did not operationalize the authors' gender on the paper level, but analyzed all links between authors and citing authors. The gender homophily rate was generally calculated as the difference in the share of female-authored cited references between female-authored focal papers and male-authored focal papers (for all studies listed in Table S1 the homophily measurement was based on the gender distribution in cited references instead of the citing papers, as we did in our main analyses). To achieve this measurement, the studies did not calculate the share of female-authored (or male-authored) cited references separately for each focal paper to summarize these shares in a next step, but instead pooled the cited references of all female-authored focal papers (male-authored focal papers). The gender homophily rate reported in Table S1 was then calculated as the share of female-authored cited references of female-authored papers minus the share of female-authored cited references of male-authored papers. Note that for studies that considered all authors of a paper to operationalize the gender on the paper level, mixed author teams neither belonged to the female-authored nor male-authored papers but were instead dropped, as was the case in our analyses based on pairing of papers. Fig. S2 shows the gender homophily rate for all studies listed in Table S1.
All studies restricted their data to a limited number of fields by including only particular papers' titles and abstracts by matching each female-authored paper to the most similar maleauthored paper in the same issue of the same journal (using the gender of the first and corresponding authors to operationalize gender on the paper level). Although this approach considers more information to control for the papers' research and subject area than the approaches used in other studies, the ability to identify papers that are similar in their research questions/topics may be limited due to the relatively small number of papers -and thus topics -covered in a journal issue. Their approach also does not make it possible to control the similarity between papers at different levels of granularity, as we did in our analyses.

S2.2. Robustness checks
In order to test the reliability of our results, we performed several robustness checks. The goal of these additional analyses was to test how the results change when different methodological approaches are applied. In general, the robustness checks support our conclusion: only a small degree of gender homophily in citations can be found after controlling for the similarity between papers, while inadequate measures of topic similarity can inflate the observed gender homophily.
If not explicitly stated otherwise, all analyses described in this section are based on Faculty Opinions data.

S2.2.1. Regression analyses using pairs of focal papers
Pairing the papers for the analyses shown in Fig Table S2. Fig. S3 reveals the predicted difference in the share of male-authored citing papers for the different numbers of shared Faculty Opinions keywords while setting the other variables (difference in quality ratings, age, and team size) to zero. Setting to zero means that the difference in the share of male-authored citing papers is predicted for the case that both papers of a pair have the same quality rating, age, and team size. The results indicate a pattern similar to the histograms in Fig. 4: the more Faculty Opinions keywords the paired papers share (i.e., the higher their similarity in research topics), the smaller the difference in the share of male-authored citing papers gets. This confirms the result that the gendered citation patterns found in the Faculty Opinions data (in the sense that overall, male scientists are more likely to cite male-authored papers than female scientists, and vice versa) can in large part be explained by the specialization of male and female scientists in different research topics, but not by differences in quality, age or team size. However, a high granularity of topic similarity measurement is needed (based on combinations of keywords) in order to identify the large impact of this structural aspect explaining gendered citation patterns.

S2.2.2. Including self-citations
For the analyses described in the main text, we excluded all self-citations because selfcitations artificially increase the observed degree of gender homophily. We defined self-citations as focal paper-citing paper pairs for which at least one author of the focal paper is identical to at least one author of the citing paper [7]. This is based on the assumption that papers were written by the identical author if the focal paper and the citing paper have at least one common author name, taking into account the first initial and the full surname. It can be expected that in some cases the first initial and the full surname are shared by different persons and therefore the assumption that the authors are the same person is wrong. At the same time, it can be assumed that the name representation rarely differs for a person [8]. Thus, this approach for identifying selfcitations is likely to overestimate the existence of self-citations, but most self-citations can be assumed to be found. [e.g., 9, 10, 11], these results show that self-citations contribute to gendered citation patterns. The pattern that the difference in the share of male-authored citing papers becomes smaller the better the topic is controlled (i.e., the minimum number of shared keywords increases) does not change when self-citations are included in the analyses. At the same time, these extended analyses suggest that evidence for gender homophily is overestimated when self-citations are not excluded (as was done in some previous studies, see Table S1). In order to validly measure gender homophily as a preference for citing colleagues of the same gender (and not just one's own work), our empirical evidence clearly reveals that this methodological step of excluding self-citations is very important.
There is a much larger average difference in the share of citing papers authored by males left that one might erroneously interpret as gender homophily when not excluding self-citations: the difference increases by 7.2 percentage points when matching papers with at least five shared Faculty Opinions keywords (the difference is then 8.83, instead of 1.61 when excluding selfcitations; see Fig. 4). The lager difference that is driven by self-citations should not be confused with homophily.

S2.2.3. Using cited references instead of citing papers
Most of the existing studies on gender homophily in citations analyzed references cited in focal papers in order to answer the question whether female-authored focal papers are less likely to refer to male-authored papers compared to male-authored focal papers. Rather than following this approach of analyzing the cited references in focal papers, we decided to use the papers that cited the focal papers in order to analyze the question whether male-authored focal papers are more likely to be cited by male scientists than female-authored focal papers. The major advantage of using citing papers (instead of cited references) is that the publication year of the papers (which is used to measure homophily bias) can be held constant to a greater extent in the analyses. The approach allows for a better standardization of the overall gender composition in science, which might influence gender-specific citation patterns. The gender distribution changed over time, and this time-trend may lead to an overestimation of homophily bias: the analysis of cited references can go a long way back in time when the share of female scientists who could be cited was smaller.  4). The difference between both results suggests that controlling publication year in the analysis is important in order to validly measure gender homophily in citations. In the analysis of citing papers, the publication year of the papers can be hold constant to a greater extent than in the analysis of cited references (see above).

S2.2.4. Gender assignments
For a given first name and country (if available), the database that we used for inferring the authors' gender differentiates between "is mostly female/male" and "is female/male". In our main analyses, we used both types of gender classification. Fig. S6 presents empirical results as shown in Fig. 4 but dropping all names with less reliable gender classifications: in the histograms, the difference in the share of male-authored citing papers for pairs of focal papers is shown only for papers with the more reliable gender assignment "is female/male." The results are very similar to using both types of gender classification: the difference in the share of male-authored citing papers decreases from 13.58 to 0.62 percentage points when using the more restrictive gender assignment, while it diminishes from 12.64 to 1.61 percentage points when using the less restrictive assignments including also the "mostly female/male" classification (see again Fig. 4 in our main analyses). We conclude that the reliability of gender assignments indicated by the database does not play a significant role for our results.

S2.2.5. Alternative approaches for controlling similarity
Similarity between papers can be controlled by various approaches. We deem abstracts and titles the most adequate alternative to Faculty Opinions keywords (when expanding the results to papers not included in the Faculty Opinions database). Abstracts and titles usually contain comprehensive information about the content and research topics of papers. Fig. 5B and Fig. 5F in the main text show average differences in the share of male-authored citing papers when using abstracts and titles to match the papers. Fig. S7-Fig. S8 present the corresponding histograms based on this alternative approach for both Faculty Opinions and WoS data. Besides using titles and abstracts, other possibilities for controlling similarity between papers are the number of shared cited references, the number of shared keywords provided in the WoS, or the number of shared WoS subject categories [12]. WoS keywords include both keywords specified by a paper's authors and keywords automatically generated based on the titles of a paper's cited references [13,14]. Although we used very different approaches for measuring paper similarity, we always found the same pattern: the better the topic similarity between papers is controlled for, the smaller the difference in the share of male-authored citing papers gets. Since the approaches differ with regard to the share of paper pairs that fall into different similarity levels, the approaches cannot be Titles and abstracts allow for a more nuanced measurement of topic similarity, since they allow defining different levels of similarity with a considerable number of pairs that can be matched.
The key take-away of these robustness checks is, however, that thorough controls for research topics are important for identifying genuine gender effects.

S2.2.6. Excluding papers with extreme gender distributions among citing papers
Papers included in the Faculty Opinions data are on average cited more often than other papers in the same field [15]. Thus, there are more focal papers with large citation counts in the Faculty Opinions data (that we used in the main analyses in this study) than in the WoS data.
Having smaller citation counts in the WoS data increases the chance of having 100% female-authored or 100% male-authored citing papers, because there are fewer possible shares of femaleauthored and male-authored citing papers. For example, a paper with one citation can only have 0% or 100% male-authored (female-authored) citing papers, a paper with two citations can only have 0%, 50% or 100% male-authored (female-authored) citing papers (again, excluding mixedauthored papers, to be able to pair only male-and female-authored papers). While papers with large citation counts may have a share of male-authored (female-authored) citing papers close to but less than 100%, most papers with few citation counts would have a share of male-authored Thus, we observe the same pattern as in other analyses: controlling for the similarity between papers reduces the difference in the share of male-authored citing papers as evidence for gender homophily; and this result is very robust to using alternative sample restrictions (here: including and excluding papers with extreme gender distributions among citing papers).

S2.2.7. Level of analysis
Using pairs of focal papers may result in certain papers having a stronger influence on the results than other papers. For example, imagine two female-authored papers: paper A, for which there are nine male-authored papers with two shared Faculty Opinions keywords, and paper B, for which there is only one male-authored paper with two shared Faculty Opinions keywords. This means that in the analysis based on all possible pairs with at least two shared Faculty Opinions keywords, nine pairs containing paper A are considered, but only one pair containing paper B. In this scenario, paper A would have a stronger influence on the result than paper B, since 90% of pairs of papers are based on paper A and only 10% on paper B. Extreme values in the share of male-authored citing papers for papers that are included in many pairs would have a great effect on the results. We assume that this should not make a difference, since extreme values can be expected to occur at both ends of the spectrum between a small and a large share of male-authored citing papers. In order to empirically verify our assumption, we performed additional analyses by changing the level of analysis from pairs of focal papers to focal papers. For every male-authored focal paper, we calculated the average difference in the share of male-authored citing papers to all paired female-authored focal papers. This results in one value for each male-authored focal paper, which can be interpreted as the average difference in the share of male-authored citing papers between the male-authored focal paper and its paired female-authored focal papers.

S2.2.8. Share of female-authored citing papers instead of male-authored citing papers
In our main analyses, we focused on the share of male-authored citing papers in order to assess the degree of gender homophily in citation decisions. Since gender homophily in citations is the preference to cite authors of the same gender, gender homophily can also be operationalized by the difference in the share of female-authored citing papers. If female authors were more likely to cite other female authors, the share of female-authored citing papers would differ between maleauthored and female-authored focal papers. Fig. S16 shows the histograms for the differences in the share of female-authored citing papers for all pairs of focal papers from the Faculty Opinions data using Faculty Opinions keywords to control for paper similarity. Here, we calculated the differences as the share of female-authored citing papers for the male-authored focal paper minus the share for the female-authored focal paper. This means that negative values indicate gender homophily in citations.
Without controlling for the similarity between papers, the difference between maleauthored and female-authored focal papers in the share of female-authored citing papers is smaller than in the share of male-authored citing papers (8.21 vs. 12.64 percentage points). This may be due to the generally small share of female-authored citing papers: if both papers of a pair have a relatively small share of female-authored citing papers, the difference between them cannot be large either. In line with the analyses using the share of male-authored citing papers, the difference in the share of female-authored citing papers decreases when controlling for the similarity between papers. Most of the gender differences in citations disappears after controlling for the similarity between papers, and there is only a small degree of gender homophily in citations left once topic similarity is controlled. The remaining absolute difference is 2.4 percentage points, which is close to the 1.6 percentage points when using male-authored citing papers (see Fig. 4). The only major contrast to the share of male-authored citing papers is as follows: when using the share of femaleauthored (instead of male-authored) citing papers, already the first levels of topic similarity based on only one or two shared Faculty Opinions keywords are sufficient to net out nearly all gender differences. The difference hardly shrinks further when controlling for more than two Faculty Opinions keywords. One reason for this result may be the smaller average difference in the share of female-authored citing papers when not controlling for similarity: if there is only a small average difference, it cannot get much smaller any more when (further) controlling for the similarity.
These robustness checks (based on the difference in the share of female-authored citing papers) also support our main finding that controlling for the similarity between papers is important, even though the granularity of this similarity is not as important as when using the share of male-authored citing papers.   Table S11 for details on the calculation of the gender homophily rates.  Table S2 for pairs of focal papers using the Faculty Opinions data. The dependent variable is the difference in the share of male-authored citing papers. Other predictor variables (difference in quality ratings, age, and team size) are set to zero.          S12. Histograms for the differences in the share of male-authored citing papers for pairs of focal papers (Faculty Opinions data, using titles and abstracts for measuring paper similarity and excluding papers with extreme gender distributions among citing papers). In each histogram, the pairs of focal papers are restricted to those cases in which one focal paper is authored only by male scientists and the other focal paper is authored only by female scientists. Positive differences result when the male-authored paper of a pair has a higher share of maleauthored citing papers than the female-authored paper of this pair. The histograms differ in the minimum cosine similarity between the tf-idf of two paired papers, and -as a consequence -in   S13. Histograms for the differences in the share of male-authored citing papers for pairs of focal papers (WoS data, using titles and abstracts for measuring paper similarity and excluding papers with extreme gender distributions among citing papers). In each histogram, the pairs of focal papers are restricted to those cases in which one focal paper is authored only by male scientists and the other focal paper is authored only by female scientists. Positive differences result when the male-authored paper of a pair has a higher share of maleauthored citing papers than the female-authored paper of this pair. The histograms differ in the minimum cosine similarity between the tf-idf of two paired papers, and -as a consequence -in  In each histogram, the male-authored focal papers are restricted to those that could be paired with at least one female-authored focal paper. Positive average differences result when the share of male-authored citations is higher for the male-authored focal paper than the average share for its paired female-authored papers. The histograms differ in the minimum number of shared keywords that the pairs of focal papers have, and -as a consequence -in the number of male-authored focal papers included: all 9,280 papers in (A), 9,150 papers that could be paired based on at least one shared Faculty Opinions keyword in (B), 7,533 papers that could be paired based on at least two shared Faculty Opinions keywords in (C), 5,062 papers that could be paired based on at least three shared Faculty Opinions keywords in (D), 2,644 papers that could be paired based on at least four shared Faculty Opinions keywords in (E), and 1,110 papers that could be paired based on at least five shared Faculty Opinions keywords in (F). The vertical lines are placed at 0 (black) and at the observed average difference (red, dashed). The black curve shows the shape of a normal distribution. In each histogram, the female-authored focal papers are restricted to those that could be paired with at least one male-authored focal paper. Positive average differences result when the share of male-authored citations is smaller for the female-authored focal paper than the average share for its paired male-authored papers. The histograms differ in the minimum number of shared keywords that the pairs of focal papers have, and -as a consequence -in the number of female-authored focal papers included: all 1,261 papers in (A), 1,257 papers that could be paired based on at least one shared Faculty Opinions keyword in (B), 1,078 papers that could be paired based on at least two shared Faculty Opinions keywords in (C), 713 papers that could be paired based on at least three shared Faculty Opinions keywords in (D), 427 papers that could be paired based on at least four shared Faculty Opinions keywords in (E), and 210 papers that could be paired based on at least five shared Faculty Opinions keywords in (F). The vertical lines are placed at 0 (black) and at the observed average difference (red, dashed). The black curve shows the shape of a normal distribution.  Note. Standard errors in parentheses. * p < 0.05, ** p < 0.01, *** p < 0.001 (two-tailed test).